Random Effects Models and Panel Data

Autocorrelation

Possible fixes (that don’t require throwing out data )

  • Clustered Standard Errors: using either the bootstrap or a sandwich estimator to adjust the standard errors up

  • Fixed effects: make a dummy for each group and include it in the model, (potentially) eliminating the autocorrelation while also controlling for unobserved differences across groups

  • Random effects

“Random” vs. Fixed

(the terminology here is inconsistent and contested, but no one has come up with better words yet)

  • Fixed Effects Models give every group its own y-intercept. The effects of “Texas” vs. “Tennessee” are assumed to be fixed and unrelated.

    • In this sense, you’re usually assuming a fixed effect any time you run a standard OLS model.
  • Random Effects Models assume that group intercepts are random draws from a common distribution. They can vary just like any other random variable, but they vary in a predictable way

An example: students in classrooms

  • You want to estimate the effect of time spent studying on student achievement

  • But you also want to account for differences between teachers: you suspect some are slightly better than others.

flowchart LR
Z[School] --> A
Z --> G
Z --> L
A[Classroom 1] --> B(Student A)
A --> C(Student B)
A --> E(Student D)
G[Classroom 2] --> H(Student H)
G --> I(Student I)
G --> K(Student K)
L[Classroom 3] --> M(Student M)

An example: students in classrooms

You calculate some averages and your results look like this:

Instructor Number of students Average SAT score
Teacher 1 50 1029
Teacher 2 25 1020
Teacher 3 5 1120

Under a fixed effects assumption, each teacher’s effect is a fixed parameter, and knowing the mean score for teacher 1 and 2 doesn’t tell you anything about what the mean score should be for teacher 3.

Under a random effects assumption, the teacher effect is drawn from a larger distribution that has a central tendency and a variance. So you might look at the value for teacher 3 and conclude: “this seems implausible, in reality, this effect should probably be closer to the effect for the other two”

An example: students in classrooms

So we could assume something like this:

\[ SAT = \text{teacher effect} + b\_\text{studying effect} \times \text{study time} + \epsilon \] \[ \epsilon \sim \mathcal{N}(0, \sigma) \]

But we could assume a model with some kind of a “teacher effect” that also had a normal distribution like the error term:

\[ SAT = (\text{teacher effect}) + b\_\text{studying effect} \times \text{study time} + \epsilon\]

\[ \text{teacher effect} \sim \mathcal{N}(0, \tau) \]

In the second case, we estimate a model that slightly biases teacher effects towards the grand mean

Random Effects: pooling

One way to think about a random effect is that it represents a compromise between “no pooling” and “complete pooling” of regression results.

  • “No pooling” would mean estimating a totally separate model for each teacher

  • “Complete pooling” would mean estimating a model that totally ignores teacher effects.

  • “Partial pooling” would mean estimating a model that uses a weighted average of these cases

Random Effects: pooling

A partial pooling estimate favors “no pooling” estimate if group \(j\) is large and low-variance, and favors the “complete pooling” case if group \(j\) is small or high variance.

\[ \text{Estimate of random intercept } \alpha_j \approx \frac{\frac{n_j}{\sigma^2_j}}{\frac{n_j}{\sigma^2_j} + \frac{1}{\sigma^2_j}} (\bar{y_j} - \beta\bar{x}_j) + \frac{\frac{1}{\sigma^2_a}}{\frac{n_j}{\sigma^2_y} + \frac{1}{\sigma^2_a}} \mu_a \]

Random Effects: pooling

Random Effects: pooling

  • Random effects models will often give more plausible estimates for cases with small sample sizes - because they “pool” extreme values with small samples.

    • Think of it like estimating a batting average for a player with 2 at-bats. Is plausible they’re batting 100? Or should we assume they’re around the league average until we get more information?

Random Effects

library(fixest)
# county level election results

fmodel<-feols(perc_gop ~ Income + White + AfAm + Hisp  | state_name, data=counties)
regmodel<-feols(perc_gop ~ Income + White + AfAm + Hisp , 
                data =counties, cluster ~ state_name)

mlist<-list("No fixed effects" = regmodel, 
            'Fixed effects' = fmodel)
modelsummary(mlist, 
             estimate  = "{estimate}",  
             statistic = c("conf.int", 'std.error'),
             conf_level = .95,
             coef_rename = coefnames,
             coef_omit ='Intercept',
             gof_map = c('nobs','aic','bic', 'vcov.type',
                         'FE: state_name'
             )
)
tinytable_yqif07f5i96db9hbk3in
No fixed effects Fixed effects
Median income ($1000) -0.374 -0.257
[-0.444, -0.304] [-0.308, -0.206]
(0.035) (0.025)
% White 0.574 0.592
[0.399, 0.748] [0.525, 0.659]
(0.087) (0.034)
% African American -0.047 -0.249
[-0.247, 0.153] [-0.362, -0.135]
(0.100) (0.057)
% Hispanic 0.427 0.275
[0.181, 0.673] [0.152, 0.398]
(0.122) (0.061)
Num.Obs. 3115 3115
AIC 24166.5 22175.2
BIC 24196.7 22507.6
Std.Errors by: state_name by: state_name
FE: state_name X

Random Effects

To specify a random effect, we can use the lme4 package. The (1|state_name) says that I want to estimate random intercepts for each state:

library(lme4)
# fixed effects version for comparison:
fixefmodel<-fixest::feols(perc_gop ~ Income + White + AfAm + Hisp  | state_name, 
                          data=counties)

ranefmodel<-lmer(perc_gop ~ Income + White + AfAm + Hisp + (1|state_name),
               data=counties
               )

Random Effects

tinytable_xzk8lybmay7bsfzzh2p4
No fixed effects Fixed effects Random effects
Median income ($1000) -0.374 -0.257 -0.260
[-0.444, -0.304] [-0.308, -0.206] [-0.281, -0.239]
(0.035) (0.025) (0.011)
% White 0.574 0.592 0.588
[0.399, 0.748] [0.525, 0.659] [0.543, 0.633]
(0.087) (0.034) (0.023)
% African American -0.047 -0.249 -0.249
[-0.247, 0.153] [-0.362, -0.135] [-0.305, -0.194]
(0.100) (0.057) (0.028)
% Hispanic 0.427 0.275 0.272
[0.181, 0.673] [0.152, 0.398] [0.219, 0.326]
(0.122) (0.061) (0.027)
SD (Observations) 8.430
Num.Obs. 3115 3115 3115
AIC 24166.5 22175.2 22370.0
BIC 24196.7 22507.6 22412.3
Std.Errors by: state_name by: state_name
FE: state_name X

Random Effects: additional covariates

With fixed effects models, we can’t include any controls for group-level covariates because they’re already in the “fixed” effect:

library(rvest)
page<-read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union')

join_date<-page|>
  html_element(css='table')|>
  html_table()|>
  select(2:4)|>
  mutate(datestring = str_extract(
    `Date(admitted or ratified)`, "([A-Z][a-z]+ [0-9]{1,2}, [0-9]{4})"),
         date = mdy(datestring)
  )
counties<-counties|>
  left_join(join_date, by=join_by(state_name==State))|>
  mutate(join_year = year(date) - 1787
         
         )|>
  drop_na(join_year)

fixefmodel<-fixest::feols(perc_gop ~ Income + White + AfAm + Hisp + join_year  | state_name, 
                          data=counties)

fixefmodel
OLS estimation, Dep. Var.: perc_gop
Observations: 3,114
Fixed-effects: state_name: 50
Standard-errors: Clustered (state_name) 
        Estimate Std. Error   t value   Pr(>|t|)    
Income -0.257136   0.025502 -10.08291 1.5294e-13 ***
White   0.591932   0.033514  17.66219  < 2.2e-16 ***
AfAm   -0.248722   0.056598  -4.39450 5.9451e-05 ***
Hisp    0.274869   0.061121   4.49710 4.2332e-05 ***
... 1 variable was removed because of collinearity (join_year)
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8.35604     Adj. R2: 0.713165
                Within R2: 0.567463

Random Effects: additional covariates

But random effects models can still estimate effects for group-level characteristics:

ranefmodel<-lmer(perc_gop ~ Income + White + AfAm + Hisp + join_year + (1|state_name),
               data=counties
               )

summary(ranefmodel)
Linear mixed model fit by REML ['lmerMod']
Formula: perc_gop ~ Income + White + AfAm + Hisp + join_year + (1 | state_name)
   Data: counties

REML criterion at convergence: 22348.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.1757 -0.4745  0.0931  0.5828  3.3831 

Random effects:
 Groups     Name        Variance Std.Dev.
 state_name (Intercept) 113.01   10.63   
 Residual                71.06    8.43   
Number of obs: 3114, groups:  state_name, 50

Fixed effects:
            Estimate Std. Error t value
(Intercept) 31.27089    3.36819   9.284
Income      -0.25919    0.01074 -24.124
White        0.59241    0.02327  25.463
AfAm        -0.24204    0.02853  -8.482
Hisp         0.27559    0.02739  10.061
join_year    0.06814    0.03294   2.068

Correlation of Fixed Effects:
          (Intr) Income White  AfAm   Hisp  
Income    -0.276                            
White     -0.693  0.072                     
AfAm      -0.630  0.154  0.830              
Hisp      -0.584  0.061  0.816  0.691       
join_year -0.576  0.018  0.095  0.105  0.058

Random Effects: random slopes

We can also estimate random slopes using this same basic approach.

ranefmodel<-lmer(perc_gop ~  White + AfAm + Hisp + join_year + (Income |state_name),
               data=counties
               )

library(marginaleffects)
nd<-datagrid(model=ranefmodel, 
             state_name = c("Alabama", "Wyoming", "Maryland", "Virginia", "Nevada", "North Carolina"), 
             "Income" = seq(17, 170, by=20))

plot_predictions(ranefmodel, newdata=nd, by=c('Income', 'state_name')) +
  theme_bw() +
  labs(color ='State', fill='State',  y='% GOP vote')

Random Effects: random slopes

What to use

  • Fixed Effects model are best when:
    • You want to control for un-modeled differences across groups
    • You think groups are fundamentally “different”
    • You don’t want to control for any group-level characteristics
  • Random Effects models are preferable when:
    • There are lots of small groups with very few members
    • You’re less worried about unmeasured heterogeneity between clusters
    • You want to include characteristics across different “levels” or even allow the slopes to vary or certain observations

Temporal autocorrelation

  • OLS assumes that data are independent, but your height today is not independent of your height yesterday.
  • autocorrelation: a value correlates with itself.
    • Temporal autocorrelation is when a current value depends on a prior value.

Problem: autocorrelation

Current values often depend on prior values. The democracy score of Paraguay in 2000 is a good predictor of current democracy scores in Paraguay.

vdem<-vdemdata::vdem|>
  filter(e_regionpol == "2")|>
  select(country_name, year, v2x_polyarchy, e_gdppc, e_pop)

vdem|>
  filter(country_name=='Paraguay')|>
ggplot(aes(x=year, y=v2x_polyarchy)) +
  geom_line()  +
  facet_wrap(~country_name) +
  theme_minimal()

Problem: autocorrelation

Lack of independence poses similar problems to what we observed with clustered samples: we will generally overstate our level of confidence in estimates. We actually have very few truly independent observations here.

model<-vdem|>
  filter(country_name=='Paraguay')|>
  filter(year>=1995)|>
  lm(v2x_polyarchy ~ log(e_gdppc), data=_)

huxreg(model)
(1)
(Intercept)0.278 **
(0.086)  
log(e_gdppc)0.112 **
(0.032)  
N25       
R20.347   
logLik53.262   
AIC-100.524   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Problem: autocorrelation

plot(model)

Spurious correlation

Time trends can also be a source of spuriousness. Lots of things increase over time, so it will all look significant in a naive model.

vdem|>
ggplot(aes(x=log(e_pop), y=v2x_polyarchy)) +
  geom_line()  +
  geom_smooth(method='lm', se=FALSE)+
  facet_wrap(~country_name) +
  theme_minimal() +
  xlab('logged population') +
  ylab('polyarchy score')

Seasonality

Failure to account for cycles can also just lead to poor prediction in the face of complex trends. The overall trend in total airline flights looks like this:

flights <- readRDS("C:/Users/neilb/Documents/GVPT728_Winter24/flights.RData")|>
  mutate(date =as.Date(date),
         year = year(date),
         quarter = quarter(date)
         )
ggplot(flights, aes(x=date, y=total_flights)) + 
  geom_line() +
  theme_minimal() + 
  geom_smooth(method ='lm', se=FALSE) +
  ylab("total flights") +
  xlab('date') 

Seasonality

… but we do a poor job of capturing that if we’re only looking at a few years of data:

ggplot(flights, aes(x=date, y=total_flights)) + 
  geom_line() +
  theme_minimal() + 
  geom_smooth(method ='lm', se=FALSE) +
  ylab("total flights") +
  xlab('date') +
  facet_wrap(~year, scales='free')

Potential fixes: more fixed or random effects

  • We could treat time as a factor and include it as a random effect (but we can’t do time AND case fixed effects because then nothing varies!)

  • Would could treat time as a random effect term

  • We could control for time linearly or with a polynomial or spline function

Potential fixes: differencing

  • If data depends on prior values, taking the first difference should address it

  • difference = current_value - prior_value

Differences

vdem|>
  filter(country_name=='Paraguay')|>
  mutate(polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
         group='differenced'
         )|>
  bind_rows(vdem|>filter(country_name=="Paraguay")|>
              mutate(group='original',
            polyarchy = v2x_polyarchy)
            )|>
ggplot(aes(x=year, y=polyarchy, color=group)) +
  geom_line()  +
  facet_wrap(~group) +
  theme_minimal() +
  ylab('Paraguay democracy score')

Differences

Taking the difference of both polyarchy and GDP in this model should eliminate the time trend effect. But it also alters what we’re actually regressing.

differenced_model<-vdem|>
  filter(country_name=='Paraguay')|>
  mutate(diff_polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
         diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc))
         
         )|>
  filter(year>=1995)|>
  lm(diff_polyarchy ~ diff_gdp, data=_)

huxreg("original_model"= model, "differenced_model" =differenced_model)
original_modeldifferenced_model
(Intercept)0.278 **0.003 
(0.086)  (0.003)
log(e_gdppc)0.112 **     
(0.032)       
diff_gdp       0.002 
       (0.115)
N25       25     
R20.347   0.000 
logLik53.262   75.351 
AIC-100.524   -144.701 
*** p < 0.001; ** p < 0.01; * p < 0.05.

Differences

plot(differenced_model)

Differences: change vs. level

A change in the rate of change in the GDP with lead to a change in the rate of change of the democracy score.

huxreg("differenced_model" =differenced_model)
differenced_model
(Intercept)0.003 
(0.003)
diff_gdp0.002 
(0.115)
N25     
R20.000 
logLik75.351 
AIC-144.701 
*** p < 0.001; ** p < 0.01; * p < 0.05.

Differencing pros and cons

  • Pro: can account for the primary time series problem without adding any new parameters

  • Con: is actually a different model! Estimates changes, not levels!

  • Con: transforms the data in a way that makes it difficult to get predictions

Potential fixes: lagged variable model

Another option is to include a lagged variable itself as a covariate. This is typically referred to as an autoregressive model.

ar_model<-vdem|>
  filter(country_name=='Paraguay')|>
  arrange(year)|>
  mutate(lag_polyarchy = dplyr::lag(v2x_polyarchy),
         lag_gdp = dplyr::lag(e_gdppc)
         
         )|>
  filter(year>=1995)|>
  lm(v2x_polyarchy ~ lag_polyarchy  + log(e_gdppc)  + log(lag_gdp) , data=_)

AR model

huxreg('autoregressive' = ar_model)
autoregressive
(Intercept)0.110 *  
(0.046)   
lag_polyarchy0.896 ***
(0.099)   
log(e_gdppc)0.169    
(0.140)   
log(lag_gdp)-0.188    
(0.139)   
N25        
R20.913    
logLik78.417    
AIC-146.833    
*** p < 0.001; ** p < 0.01; * p < 0.05.

AR model: multiple lags

One advantage of this approach is that you can simultaneously test multiple lags, which helps if you think there’s more than one source of trend.

ar_model2<-vdem|>
  filter(country_name=='Paraguay')|>
  mutate(lag_polyarchy = dplyr::lag(v2x_polyarchy),
         lag_gdp = dplyr::lag(e_gdppc),
         lag_gdp2 = dplyr::lag(e_gdppc, 2),
         lag_polyarchy2= dplyr::lag(v2x_polyarchy, 2),
         )|>
  filter(year>=1995)|>
  lm(v2x_polyarchy ~ lag_polyarchy  + lag_polyarchy2+
       log(e_gdppc)  + 
       log(lag_gdp)  + 
       log(lag_gdp2), data=_)

huxreg(ar_model2)
(1)
(Intercept)0.129 *  
(0.047)   
lag_polyarchy1.192 ***
(0.199)   
lag_polyarchy2-0.387    
(0.208)   
log(e_gdppc)0.112    
(0.149)   
log(lag_gdp)0.096    
(0.260)   
log(lag_gdp2)-0.216    
(0.157)   
N25        
R20.936    
logLik82.297    
AIC-150.593    
*** p < 0.001; ** p < 0.01; * p < 0.05.

AR model: pros and cons

  • Pro: allows a much more complex model specification and also tests for autocorrelation

  • Con: interpretation is weird, and it adds a lot of parameters and inevitable multicollinearity

Other fixes

  • Including a fixed effect for time can work if you have multiple observations for different periods or if you only suspect a cyclical trend

  • More complicated autocorrelation structures can be modeled with a regression, and then OLS can be used on the residuals from that model

ACF plots

ACF plots can be used to visualize autocorrelations. If the lagged correlation is above the blue line, then there’s statistically meaningful autocorrelation.

vdem|>
  filter(country_name=='Paraguay')|>
  arrange(year)|>
  filter(year >1990)|>
  mutate(diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc)))|>
  filter(!is.na(diff_gdp))|>
  with(acf(diff_gdp, lag.max = 5, plot = T))

vdem_p<-vdem|>
  filter(country_name=='Paraguay')|>
  arrange(year)|>
  filter(year >1990)|>
  mutate(log_gdp = log(e_gdppc))|>
  filter(!is.na(e_gdppc))|>
  with(acf(log_gdp, lag.max = 5, plot = T))

ACF plots: seasonality

This can make it much easier to spot seasonal trends as well:

with(flights, acf(total_flights, lag.max=12, plot=T))

ACF plots of residuals

acf(differenced_model$residuals, lag.max=12, plot=T)

acf(ar_model$residuals, lag.max=12, plot=T)

library(lme4)


set.seed(100)
N<-1000
ngroups<-10
group_intercepts<-c(rnorm(ngroups-1, mean=20, sd=1), 5)
names(group_intercepts) <-letters[1:ngroups]

groups<-sample(letters[1:ngroups], 
               replace=T, 
               prob = c(rep(.1, 9), .01),
               size=N)



x<-rnorm(N)
y<-group_intercepts[groups] + rnorm(N)
df<-data.frame(groups, x, y)

groupcounts<-df|>count(groups)
  


complete<-lm(y ~ 1, data=df)|>tidy(conf.int=TRUE)|>
  mutate(groups = 'pooled')

est<-complete$estimate[which(complete$term=="(Intercept)")]
nopool<- df|>
  group_by(groups)|>
  nest()|>
  mutate(fit = map(data, ~tidy(lm(y ~ 1, data=.x), conf.int=TRUE))
       
         )|>
  unnest(fit)|>
  select(-data)


pp<-lmer(y ~ 0 + (1|groups), data = df)

rfx<-data.frame(ranef(pp))|>
  mutate(grp = paste(grp))
#rfx_ci<-confint(pp, method = "boot", nsim = 500, boot.type = "perc")


data<-bind_rows(complete, nopool)|>
  mutate(type = factor(groups == "pooled", levels=c(TRUE, FALSE), 
                       labels = c("pooled", "not pooled")))



data|>
  filter(term == "(Intercept)")|>
  arrange(groups)|>
  left_join(groupcounts)|>
  ggplot(aes(x=groups, y=estimate, color=type, 
             ymin = conf.low, ymax=conf.high, label=n)) + 
  geom_point()  + 
  geom_hline(yintercept = est, lty=2) +
  geom_pointrange() +
  scale_color_manual(values=c('red', 'black')) +
  theme_bw() +
  geom_point(data=rfx, aes(x=grp, y=condval), 
             #position = position_nudge(x=0.5),
             color = 'blue',
             inherit.aes = FALSE) +
  coord_flip()